Skip to content

Remove HF_TOKEN dependency in E2E test #357

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 21 commits into from
Aug 15, 2025

Conversation

jack8558
Copy link
Collaborator

@jack8558 jack8558 commented Aug 5, 2025

Removing HF_TOKEN dependency in E2E test

  • Created tp save_hf_model_files_to_gcs to save huggingface model files in gcs
  • Saved tokenizers and llama3-8B's model weights and configs in gcs bucket (the weights and configs are needed for SFT e2e test)
  • Since huggingface library can't load directly from gcs, added util funciton copy_gcs_to_local which download in tmp directory
  • Removed HF_TOKEN on e2e test and cpu test

tp save_hf_model_files_to_gcs example:

tp save-hf-model-files-to-gcs \
  --repo-id "meta-llama/Meta-Llama-3-8B" \
  --gcs-path "gs://bucket" \
  --file-type "all" \
  --temp-dir /mnt/disks/tmp

#14

@jack8558 jack8558 changed the title DRAFT Remove HF_TOKEN dependency in E2E test Aug 5, 2025
@jack8558 jack8558 linked an issue Aug 8, 2025 that may be closed by this pull request
@jack8558 jack8558 marked this pull request as ready for review August 8, 2025 15:31
@jialei777
Copy link
Collaborator

thank you for putting together this. My question is since the models and tokenizers are gated (and under certain term by meta), are they allowed to be saved in gcp bucket and distributed by us?

@vlasenkoalexey
Copy link
Collaborator

thank you for putting together this. My question is since the models and tokenizers are gated (and under certain term by meta), are they allowed to be saved in gcp bucket and distributed by us?

Putting weights for our own use in e2e tests is fine, distributing weights and tokenizers publicly is not.

@@ -153,8 +155,13 @@ def _maybe_save_checkpoint(self, config: DictConfig) -> None:
# Step 3: Save the HF config files and tokenizer
if xr.process_index() == 0:
logger.info("Saving Hugging Face configs and tokenizer to %s", save_dir)
model_utils.copy_hf_config_files(config.model.pretrained_model, save_dir)
model_utils.save_hf_tokenizer(config.model.pretrained_model, save_dir)
# Copy to local if in GCS
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain why is it necessary?
If training started from gcs fuse which looks like a local folder, would it still try to copy?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was needed because gcs bucket that we are loading toeknizer is not mounted by gcsfuse.

The bucket we mount in thunk.py is artifact_dir. The implementaion in this PR copies GCS content to local using gsutil instead of using gcsfuse.

)

local_dir = tempfile.mkdtemp()
_TEMP_DIRS_TO_CLEAN.append(local_dir)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bad pattern, could you make temp dir as an argument or use context manager to auto clean it?
If that's inconvenient, feel free to leave it as is.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the function with context managager. Lmk if this look better.

@jack8558 jack8558 merged commit 3a1b818 into main Aug 15, 2025
27 of 29 checks passed
@jack8558 jack8558 deleted the jackoh/remove-hf-token-in-e2e-test branch August 15, 2025 03:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Remove dependency on HuggingFace token
3 participants